15 research outputs found

    An Efficient Implementation of Parallel Parametric HRTF Models for Binaural Sound Synthesis in Mobile Multimedia

    Get PDF
    The extended use of mobile multimedia devices in applications like gaming, 3D video and audio reproduction, immersive teleconferencing, or virtual and augmented reality, is demanding efficient algorithms and methodologies. All these applications require real-time spatial audio engines with the capability of dealing with intensive signal processing operations while facing a number of constraints related to computational cost, latency and energy consumption. Most mobile multimedia devices include a Graphics Processing Unit (GPU) that is primarily used to accelerate video processing tasks, providing high computational capabilities due to its inherent parallel architecture. This paper describes a scalable parallel implementation of a real-time binaural audio engine for GPU-equipped mobile devices. The engine is based on a set of head-related transfer functions (HRTFs) modelled with a parametric parallel structure, allowing efficient synthesis and interpolation while reducing the size required for HRTF data storage. Several strategies to optimize the GPU implementation are evaluated over a well-known kind of processor present in a wide range of mobile devices. In this context, we analyze both the energy consumption and real-time capabilities of the system by exploring different GPU and CPU configuration alternatives. Moreover, the implementation has been conducted using the OpenCL framework, guarantying the portability of the code

    Evaluating the soft error sensitivity of a GPU-based SoC for matrixmultiplication

    Get PDF
    System-on-Chip (SoC) devices can be composed of low-power multicore processors combined with a small graphics accelerator (or GPU) which offers a trade-off between computational capacity and low-power consumption. In this work we use the LLFI-GPU fault injection tool on one of these devices to compare the sensitivity to soft errors of two different CUDA versions of matrix multiplication benchmark. Specifically, we perform fault injection campaigns on a Jetson TK1 development kit, a board equipped with a SoC including an NVIDIA ”Kepler“ Graphics Processing Unit (GPU). We evaluate the effect of modifying the size of the problem and also the thread-block size on the behaviour of the algorithms. Our results show that the block version of the matrix multiplication benchmark that leverages the shared memory of the GPU is not only faster than the element-wise version, but it is also much more resilient to soft errors. We also use the cuda-gdb debugger to analyze the main causes of the crashes in the code due to soft errors. Our experiments show that most of the errors are due to accesses to invalid positions of the different memories of the GPU, which causes that the block version suffers a higher percentage of this kind of errors

    Evaluating the computational performance of the Xilinx Ultrascale plus EG Heterogeneous MPSoC

    Get PDF
    The emergent technology of Multi-Processor System-on-Chip (MPSoC), which combines heterogeneous computing with the high performance of Field Programmable Gate Arrays (FPGAs) is a very interesting platform for a huge number of applications ranging from medical imaging and augmented reality to high-performance computing in space. In this paper, we focus on the Xilinx Zynq UltraScale+ EG Heterogeneous MPSoC, which is composed of four different processing elements (PE): a dual-core Cortex-R5, a quad-core ARM Cortex-A53, a graphics processing unit (GPU) and a high end FPGA. Proper use of the heterogeneity and the different levels of parallelism of this platform becomes a challenging task. This paper evaluates this platform and each of its PEs to carry out fundamental operations in terms of computational performance. To this end, we evaluate image-based applications and a matrix multiplication kernel. On former, the image-based applications leverage the heterogeneity of the MPSoc and strategically distributes its tasks among both kinds of CPU cores and the FPGA. On the latter, we analyze separately each PE using different matrix multiplication benchmarks in order to assess and compare their performance in terms of MFlops. This kind of operations are being carried out for example in a large number of space-related applications where the MPSoCs are currently gaining momentum. Results stand out the fact that different PEs can collaborate efficiently with the aim of accelerating the computational-demanding tasks of an application. Another important aspect to highlight is that leveraging the parallel OpenBLAS library we achieve up to 12 GFlops with the four Cortex-A53 cores of the platform, which is a considerable performance for this kind of devices

    On the performance of a GPU-based SoC in a distributed spatial audio system

    Get PDF
    [EN] Many current system-on-chip (SoC) devices are composed of low-power multicore processors combined with a small graphics accelerator (or GPU) offering a trade-off between computational capacity and low-power consumption. In this context, spatial audio methods such as wave field synthesis (WFS) can benefit from a distributed system composed of several SoCs that collaborate to tackle the high computational cost of rendering virtual sound sources. This paper aims at evaluating important aspects dealing with a distributed WFS implementation that runs over a network of Jetson Nano boards composed of embedded GPU-based SoCs: computational performance, energy efficiency, and synchronization issues. Our results show that the maximum efficiency is obtained when the WFS system operates the GPU frequency at 691.2 MHz, achieving 11 sources-per-Watt. Synchronization experiments using the NTP protocol show that the maximum initial delay of 10 ms between nodes does not prevent us from achieving high spatial sound quality.This work has been supported by the Spanish Government through TIN2017-82972-R, ESP2015-68245-C4-1-P, the Valencian Regional Government through PROMETEO/2019/109 and the Universitat Jaume I through UJI-B2019-36.Belloch, JA.; BadĂ­a, JM.; Larios, DF.; Personal, E.; Ferrer Contreras, M.; Fuster Criado, L.; Lupoiu, M.... (2021). On the performance of a GPU-based SoC in a distributed spatial audio system. The Journal of Supercomputing (Online). 77(7):6920-6935. https://doi.org/10.1007/s11227-020-03577-46920693577

    Hybridization and adaptive evolution of diverse Saccharomyces species for cellulosic biofuel production

    Get PDF
    Additional file 15. Summary of whole genome sequencing statistics

    Multicore implementation of a multichannel parallel graphic equalizer

    Get PDF
    Numerous signal processing applications are emerging on mobile computing systems. These applications are subject to responsiveness constraints for user interactivity and, at the same time, must be optimized for energy efficiency. Many current embedded devices are composed of low-power multicore processors that offer a good trade-off between computational capacity and low power consumption. In this context, equalizers are widely used in multiple mobile-based applications such as “Music streaming” to adjust the levels of bass and treble in sound reproduction. In this study, we evaluate a graphic equalizer from audio, computational capacity, and energy efficiency perspectives, as well as the execution of multiple real-time equalizers running on an embedded quad-core processor of a mobile device. To this end, we experiment with the working frequencies as well as the parallelism that can be extracted from a quad-core ARM Cortex-A57. Results show that using high CPU frequencies and three or four cores, our parallel algorithm is able to equalize more than five channels per watt in real time with an audio buffer of 4096 samples, which implies a latency of 92.8 ms at the standard sample rate of 44.1 kHz.Funding for open access charge: CRUE-Universitat Jaume

    Comparison of Parallel Implementation Strategies in GPU-Accelerated System-on-Chip Under Proton Irradiation

    No full text
    Commercial off-the-shelf (COTS) system-on-chip (SoC) are becoming widespread in embedded systems. Many of them include a multicore central processing unit (CPU) and a high-end graphics processing unit (GPU). They combine high computational performance with low power consumption and flexible multilevel parallelism. This kind of device is also being considered for radiation environments where large amounts of data must be processed or compute-intensive applications must be executed. In this article, we compare three different strategies to perform matrix multiplication in the GPU of a Tegra TK1 SoC. Our aim is to analyze how the different use of the resources of the GPU influences not only the computational performance of the algorithm, but also its radiation sensitivity. Radiation experiments with protons were performed to compare the behavior of the three strategies. Experimental results show that most of the errors force a reboot of the platform. The number of errors is directly related with how the algorithms use the internal memories of the GPU and increases with the matrix size. It is also related with the number of transactions with the global memory, which in our experiments is not affected by the radiation. Results show that the smallest cross section is obtained with the fastest algorithm, even if it uses the cores of the GPU more intensively

    Reliability Evaluation of LU Decomposition on GPU-Accelerated System-on-Chip Under Proton Irradiation

    No full text
    Graphic processing units (GPUs) have become a basic accelerator both in high-performance nodes and low-power system-on-chip (SoC). They provide massive data parallelism and very high performance per watt. However, their reliability in harsh environments is an important issue to take into account, especially for safety-critical applications. In this article, we evaluate the influence of the parallelization strategy on the reliability of lower–upper (LU) decomposition on a GPU-accelerated SoC under proton irradiation. Specifically, we compare a memory bound and a compute bound implementation of the decomposition on a K20A GPU embedded on a Tegra K1 (TK1) SoC. We leverage the GPU and CPU clock frequencies both to highlight the radiation sensitivity of the GPU where we are running the benchmark and also to apply both algorithms to solve problems with the same size when exposed to the same radiation dose. Results show that more intensive use of the resources of the GPU increases the cross section. We also observed that most of the radiation-induced errors hang the operating system and even the rebooting process. Finally, we present a preliminary study of the error propagation of the LU decomposition algorithms

    The NED foundation experience: A model of global neurosurgery

    No full text
    Introduction: The Neurosurgery Education and Development (NED) Foundation (NEDF) started the development of local neurosurgical practice in Zanzibar (Tanzania) in 2008. More than a decade later, multiple actions with humanitarian purposes have significantly improved neurosurgical practice and education for physicians and nurses. Research question: To what extent could comprehensive interventions (beyond treating patients) be effective in developing global neurosurgery from the outset in low and middle-income countries? Material and method: A retrospective review of a 14- year period (2008-2022) of NEDF activities highlighting landmarks, projects, and evolving collaborations in Zanzibar was carried out. We propose a particular model, the NEDF model, with interventions in the field of health cooperation that have simultaneously aimed to equip, treat, and educate in a stepwise manner. Results: 138 neurosurgical missions with 248 NED volunteers have been reported. In the NED Institute, between Nov 2014-Nov 2022, 29635 patients were seen in the outpatient clinics and 1985 surgical procedures were performed. During the course of NEDF’s projects, we have identified three different levels of complexity (1, 2 and 3) that include the areas of equipment (''equip''), healthcare (''treat'') and training (''educate''), facilitating an increase of autonomy throughout the process. Discussion and Conclusion: In the NEDF's model, the interventions required in each action area (ETE) are coherent for each level of development (1, 2 and 3). When applied simultaneously, they have a greater impact. We believe the model can be equally useful for the development of other medical and/or surgical specialties in other low-resource healthcare settings
    corecore